This document is intended both as an open-source record of analyses conducted for one study in our paper entitled “The Effects of Metacognition in Survey Research: Experimental, Cross-Sectional, and Content-Analytic Evidence”, which was published recently in Public Opinion Quarterly, and as an introductory tutorial to Stuctural Topic Modeling (STM). If used for this latter purpose, this document is offered under the same license that governs the dataset used herein (available here). Chiefly, this document is provided “AS IS”, without warranty of any kind, express or implied. That said, if you encounter any issues, please feel free to email the first author using the link at the top of this document.
STMs are described in some detail in the paper. Additional resources are available here and here. Briefly, STMs are a form of unsupervised machine learning used to categorize a corpus of text documents. What differentiates STMs from a “bag of words” or latent Dirichlet allocation approach is the ability to specify document metadata as covariates in the model.
We encourage you to read the paper linked above. A quick synopsis is required to understand the project we undertook. We are interested in whether the language complexity used in public opinion polling differentially affects metacognitive processes, and thus methodologically-pertinent outcomes such as “don’t know” reporting, or abstaining from self-reporting. In study 1, we conducted a survey-experiment among a nationally-representative sample of U.S. adults. Participants were randomly assigned to either an “easy”- or “difficult”-language condition in which we varied the objective complexity of the language used to convey the public opinion questions. In study 2, we collected survey instruments from the Roper Center’s database of public opinion polls (data and analyses coming soon; keep an eye on this website: https://github.com/Matt-Sweitzer/). Using top-line results, we compared question difficulty to rates of “don’t know” reporting. Having found significant results for both studies, we turned our attention to the public opinion polling industry more broadly to see if language complexity varied systematically – suggesting that the methodological problem posed by complex language may be widespread and unconsidered. One of the ways we looked for systematic variance in question complexity is by using a STM to categorize the questions into topics. Given that we had such a large dataset for this latter study, using a computational approach to code topics made this component of our analyses much more feasible.
Before we begin, here is the most up-to-date information about the R session used to generate this file. To run this yourself, you will need to install the R packages stm, stringr, tm, wordcloud, and SnowballC. The latter two packages are only required to plot the wordcloud below and can be skipped for the topic model.
[1] "Wednesday October 16, 2019 - 12:28:26 AM EDT"
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Mojave 10.14.6
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] knitr_1.25 wordcloud_2.6 RColorBrewer_1.1-2
[4] tm_0.7-6 NLP_0.2-0 stringr_1.4.0
[7] stm_1.3.3 SnowballC_0.6.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 xml2_1.2.2 magrittr_1.5
[4] lattice_0.20-38 highr_0.8 tools_3.6.1
[7] parallel_3.6.1 grid_3.6.1 data.table_1.12.2
[10] xfun_0.10 htmltools_0.3.6 yaml_2.2.0
[13] digest_0.6.21 Matrix_1.2-17 base64enc_0.1-3
[16] evaluate_0.14 slam_0.1-45 rmarkdown_1.16
[19] stringi_1.4.3 compiler_3.6.1 jsonlite_1.6
The data for this study is available from my GitHub page.
data<-read.csv(url("https://raw.githubusercontent.com/Matt-Sweitzer/Metacognition_Surveys/master/Study_3/Data/2016PollsFinal.csv"))
Let’s take a look at the head of the data frame:
| Date | Poll | QuestionNum | QuestionWord | Region | N | SurveyMethod | Likely.vs.All | Ease | Grade | National | Geo | SurveyNum | Topic |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2/5/16 | Quinnipiac University | 1 | If the Republican primary for President were being held today, and the candidates were Jeb Bush, Ben Carson, Chris Christie, Ted Cruz, Carly Fiorina, Jim Gilmore, John Kasich, Marco Rubio, and Donald Trump, for whom would you vote? | US | 1125 | Phone | Registered | 34.69 | 17.86 | 1 | National | 1 | 6 |
| 2/5/16 | Quinnipiac University | 2 | Is your mind made up, or do you think you might change your mind before the primary? | US | 1125 | Phone | Registered | 90.05 | 4.92 | 1 | National | 1 | 2 |
| 2/5/16 | Quinnipiac University | 3 | Are there any of these candidates you would definitely not support for the Republican nomination for president: Bush, Carson, Christie, Cruz, Fiorina, Gilmore, Kasich, Rubio, or Trump? | US | 1125 | Phone | Registered | 36.47 | 11.53 | 1 | National | 1 | 6 |
| 2/5/16 | Quinnipiac University | 4 | If the Democratic primary for President were being held today, and the candidates were Hillary Clinton and Bernie Sanders, for whom would you vote? | US | 1125 | Phone | Registered | 41.48 | 13.44 | 1 | National | 1 | 6 |
| 2/5/16 | Quinnipiac University | 5 | Is your mind made up, or do you think you might change your mind before the primary? | US | 1125 | Phone | Registered | 90.05 | 4.92 | 1 | National | 1 | 2 |
| 2/5/16 | Quinnipiac University | 6 | If the election for President were being held today, and the candidates were Hillary Clinton the Democrat and Donald Trump the Republican, for whom would you vote? | US | 1125 | Phone | Registered | 38.43 | 14.61 | 1 | National | 1 | 6 |
These variables can be interpreted as follows:
Date: character, the date the results were published, or last date of data collection if publication date was not availablePoll: character, the polling firm either responsible for data collection or the sponsor of third-party data collectionQuestionNum: numeric, indicates the order of questions within the same ballotQuestionWord: character, wording of the question from the survey instrument
Region: character, geographic region of survey sampleN: numeric, total sample size
SurveyMethod: character (factor), method of data collection – options include “online”, “phone”, and “phone/internet”Likely.vs.All: character (factor), class of respondents targeted in sample – options include “All”, “Likely”, and “Registered”.Ease: numeric, Flesch reading ease score of QuestionWord – calculated using the koRpus package in RGrade: numeric, Flesch-Kincaid grade level of QuestionWord – calculated using the koRpus package in RNational: numeric (factor), dummy variable collapsing Region – 1 == “US”, else 0Geo: character (factor), alternative collapsing of Region – options include “Community”, “National”, and “State”SurveyNum: numeric (factor), indicates which survey instrument each question comes from – should be identical to Date-Poll combinationTopic: numeric, stm topic to which the question had the highest fit, \(\theta\) – this is what we will be estimating here!Now, let’s begin cleaning up the text corpus to make a basic corpus descriptive figure: a wordcloud. First, let’s take the vector and change it to a corpus class:
questionCorp<-Corpus(VectorSource(data$QuestionWord))
Next, lets start cleaning up some of the text. Wordclouds typically don’t have puntucation listed. Also, we might expect that the most popular words in the English language, articles such as “the” or “a”, would also be popular in our corpus – showing them in this figure may not be very informative, so let’s remove those too.
questionCorp<-tm_map(questionCorp, removePunctuation)
questionCorp<-tm_map(questionCorp, removeWords, stopwords('english'))
Finally, some in the natural language processing realm advocate for a process called “stemming”. This takes similar words and removes the suffixes which differentiate them so they can be treated as referring to the same thing. For example, “communicate”, “communication”, and “communicating” would all become “communicat”. This can be somewhat helpful when those slight differentiations create excess noise in the data, or make for an especially sparse term-document matrix. Some of these stemming algorithms are not particularly robust – your mileage may vary. I will not stem them here, but if you would like to stem your own corpus, uncomment the following line of code:
#questionCorp<-tm_map(questionCorp, stemDocument)
Great! Now that the corpus is cleaned, we can take a look at the wordcloud:
par(mar=c(0,0,0,0))
wordcloud(questionCorp, max.words=1000, random.order=FALSE)